7. The Theory of Probability 



7.1 Introduction 

In investigating weighing, and sorting and finding an efficient code, you may note a common 

thread. 

In each case, and in many others that we will encounter later, we start with a (usually large) 
number of possible situations, and are interested in determining the true situation from among them. 
Thus we have devised schemes for doing so in the contexts mentioned. 

Probability theory, which by the way was developed initially in the study of gambling 
schemes, is intended to provide a framework for modeling problems of this kind, with the aim of 
providing information about what we can "expect" to happen in each context. 

This is done by defining a Sample Space whose points or elements are the possible situations, 
each one of which is given a weight or probability according to the proportion of the time we expect it 
to be the true situation. 

Then we determine what to "expect" as the answer to some question, by representing the 
answer as a function defined on the points of the sample space, and averaging that function over them. 

In most contexts, including gambling and all that we have considered so far, the potential 
situations can be represented by strings of binary bits (or ternary symbols, in the case of weighing). In 
the following discussion we assume that our sample space consists of points each of which is such a 
bit string. 

The words used in probability theory are as follows: 

7.2 Basic Definitions and Facts 
An event corresponds to a subset of the sample space; here are some examples: 

1 . the second bit of our bit string is a 1 . 

2. half the bits are 1 (obviously only possible when the string is finite and has an even length) 

3. any other property of points in the sample space. 

The subset corresponding to an event E consists of the points of the sample space which 
have the property that defines E. 

A random variable 

is the name we give to a function defined on the points of the sample space. Here are some examples: 

1 . the number of 1 's in the bit string 

2. the position of the first 1 in the string 

3. how much you would win in your betting scheme if the given bit string came true 

4. any other (usually numeric) function defined on strings 

A particularly useful kind of random variable is an indicator random variable for an event 
E. It is 1 on strings in the event set, and elsewhere. We can call it 1(E). 



As a sort of converse of this definition, given any random variable, say f, we can, for any of its 
values, x, define E(f=x) to be the event that occurs when f(p)=x holds (and only then). 

We can then write 



f = £( 0V er all x) x*I(E(f=x)) 

Each point q of the sample space, that is, each possible bit string q, is given a weight, called its 
probability , which is usually written as p(q). These probabilities are required to be non-negative 
numbers whose sum, over the entire sample space, is 1. 

A uniform sample space 

is one in which all points have the same weight, which is therefore the reciprocal, 1/N, of the number, 
N, of points in the space. 

Each event E is given a probability, p(E), which is the sum, over all strings in its set, of 
their individual probabilities, p(q). 

The expectation of a random variable f is the sum over the points q of the sample space of 
f(q)*p(q). 

This expectation is often denoted as f with a bar over it, or as (f). It is often called the mean 
or average of the random variable f. 

The variance of a random variable f is the difference between the expectation of its square 
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and the square of its expectation, or (f > - <f> . It is often denoted by and called sigma squared of 
f, or <r 2 (f). 
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Another expression for the variance is <(f-<f» )• (0 is often called the mean value of f, f-(f) is 
called the deviation of f from its mean, so that the variance can be called the "mean square 
deviation of f from its mean". Sigma of f is then "the root mean square deviation of f '. 

The expectation of f is by definition linear in f. This means that the expectation of the sum of 
af and bg for numbers a and b and random variables f and g obeys <af+bg) = a(f> + b(g>. 

Notice that the variance of a function f is quadratic in f, and we have 

<r 2 (af) = aV(f), 

when a is a constant. On the other hand the variance of a sum of two random variables is often much 
smaller than its quadratic nature would suggest. 

Notice that the expectation of an indicator random variable (whose values are all or 1), is the 
probability of its event. 

Recall that an event E corresponds to a subset of our initial sample space, S. We define the 
conditional probability of any event Q given E to be the probability of Q in a new sample space 
defined to be the points on which E happens. 



This probability is usually written as p(Q|E). 



The probability of Q happening in general is the sum over all points p in S of p(p)*I(Q) 
divided by the sum over all p of p(p) which latter is 1 . 

The conditional probability p(Q|E) has numerator given by the probability that both Q and E 
occur, and denominator given by the probability that E occurs. 

Thus the indicator random variable for the event Q given E is the product I(E)*I(Q), which is the 
indicator of the event that both Q and E take place. The probability of the event is the sum of this 
random variable multiplied by p(p) over all points p of E (or all points of S, same thing) divided by 
the sum of I(E)*p(p) over all p in E. 

The denominator is required here because we have reduced the entire sample space S to the points 
in E. The sum of p(p) for our old probabilities over our new sample space will not usually be 1. To get 
a probability distribution that sums to 1 while maintaining the relative probabilities of points in E, we 
must take the new probabilities of each point p in E to be p(p)/(SUM(over p in E of)p(p)). 

Perhaps the simplest example of this is as follows: Suppose we toss two fair coins and want the 
probability that both are heads. This is one fourth. If we toss one and it is a head, then the probability 
that both are heads, given this information becomes Vi. This is what the remarks of the preceding 
paragraph imply. 

These remarks can be written in symbols as 

p(E|Q) = p(E &Q)/p(Q). 

There are analogous statements about random variables. You can speak of the probabilities that a 
random variable f takes on each of its variables, given that some event takes place. 

On occasion you may want to express the probability of E given Q in terms of the probability of Q 
given E. 

From the last relation we can write: 

P(E|Q) = p(E&Q)/p(Q) = (p(Q|E)/p(Q)) *p(E) . 

In words, this says that the probability of E given Q differs from the probability of E at the 
beginning, by a factor that is the ratio of the probability of Q given E to the probability of Q at the 
beginning. 

Two events are said to be mutually exclusive when both together happen on no point of our 
samples space S. 

We can deduce that the probability that either of two mutually exclusive events occur is the 
sum of the probabilities that each occur. 

If I(E)*I(Q)= then p(Q or E) = p(Q) + p(E). 

More generally when Q and E can both occur simultaneously, the points of the sample space 
on which both occur are counted twice on the right hand side here. 



The general statement is therefore: 



P(Q or E) = p(Q) + p(E) - p(Q&E). 



We can write out a similar statement when we have three or more events. 
With three events, the probability that any of them occur is the sum of the probabilities that each one 
occurs less the sum of the probabilities that two of them occur, and that will correctly account for all 
cases in which at most two of them occur. However when all three occur we should have a 
contribution but have added three and subtracted three contribution; so we must add it back again. We 
get 

P(Ql or Q 2 or Q 3 ) = £( ver j) p(Qj) - £( ver j<k) p(Qj&Qk) + p(Ql&Ql&Q3). 

Exercise 7.1: What is the analogous statement for 4 events? For n events? 

Two events A and B are said to be independent , if the probability of A given B, p(A|B), is 
exactly the same as the initial probability, p(A). 

We may deduce that 

the probability both A and B happening when A and B are independent is the product of their 
individual probabilities. For, the statement of independence is 

P(A) = p(A/B) = p(A&B) / p(B). 

Independence can also be defined for random variables. Two random variables f and g are 
said to be independent when learning the value of f does not change the probabilities that g take 
on any of its values. 

This means p(f=a) = p(f=a|g=b) = p(f=a &g=b)/p(g=b), or 
p(f=a&g=b) = p(f=a)*p(g=b), which implies that for independent random variables we have 

I(f=a,g=b) = I(f=a)*I(g=b). 

If we multiply both sides here by a*b and sum over all a and b, we obtain the condition that 
random variables f and g are independent means 

<f*g> = <f)*<g>, 

the expectation of their product is the product of their expectations. 
We now ask the question: what is the variance of the sum of two random variables, f and g? 

From its definition we have 

Var(f+g) = <(f+g) 2 > - «f+g» 2 , 
and using the linearity of the expectation and separating the f g and cross terms we get 

Var(f+g) = <f 2 )-<f> 2 + <g 2 >-<g) 2 + 2(<f*g>-<f>*<g)). 



= Var(f) + Var(g) + 2(<f*g>-<f)*<g». 



The covariance of f and g, Cov(f,g), is defined to be 

<f*g)-<f)*<g>, 



so that we get 

Var (f+g) = Var(f) + Var(g) + 2 Cov(f,g). 

Notice that the covariance of independent variables is 0. And so, when f and g are 
independent, we get 

Var(f+g) = Var(f) + Var(g). 

This innocent looking statement is actually very curious and important. We have here a linear 
type property of a quadratic function of f and g. It is the basis of the results we will need, which 
concern sums and averages of many independent random variables. 

7.3 Comments 

A superficial look at the preceding suggests that we have made lots of definitions but gotten 
essentially nowhere with them. 

Actually we have gotten very far, as we shall soon see. 

But you must realize that in fields like probabilities words are introduced that suggest 
meanings, which in fact they may not actually have. For example, we have introduced the concept of 
the expectation value 

of a random variable. Despite the name, it does not necessarily mean that it is what we expect the 
variable to be, or that it has any meaning at all. 

The expectation of a random variable is its average over your sample space with each point q 
weighted by its specified probability, p(q). This may or may not mean anything at all. For example, an 
indicator variable for an event always takes values and 1 . Yet its expectation value is the probability 
of the event, which is the sum over the points of S of p(q)*I(E). 

On the other hand, if a the variance of a random variable f is sufficiently small, the random 
variable will be near the expected value at most points of the sample space. 

If the actual case we are interested in is typical of the point in the sample space, the expected 
value of f does tell us what to expect from f, when the variance of f is small enough. 

In order to draw conclusions from probability theory, we will need one more tool: which tells 
us how we can draw conclusions about the meaning of the expectation of f from information about its 
variance. 

Our goal here is to be able to analyze the behavior of situations that can be described as long 
sequences of independent (or mostly independent) random variables. 

In particular we will be interested in the situation in which we have a long string of bits which 
contain a relatively small number of corrupted bits, and also of properties of randomly chosen bit 



strings, of a given length. 



What we can say in such situations depend on what assumptions we can make about them. 

If we are concerned with a sequence of length N of independent 1 valued random variables, 
each with the same probability of being 1 (these are called identically distributed independent 
Boolean random variables) then we can make very strong statements about their sum and their 
average. We can in that case determine the probability that their sum or average takes on any value. 

As we relax the conditions here, so that the variables do not have to all be identical, and can 
take on other values, and can have limited dependence on one another, we can say somewhat less, but 
for a wide variety of circumstances we there is s similar conclusion, called the central limit theorem. 
It has many variations based on varying assumptions about the situation modeled, usually with 
roughly the same conclusion. 

There is a weaker statement called the "law of large numbers" which is sufficient for our 
purposes here. It states that if you choose a value from a set of independent identically distributed 
random variables many times, the proportion of the time that you get any particular value will be 
close to the probability of that your random variable takes on that value, almost all the time, as your 
number of picks increases. (Actually there are several laws of large numbers.) 

This statement can serve as a justification for interpreting the probability of an event as the 
proportion of the time the event would occur, if you kept putting yourself in the situation modeled by 
your sample space, over and over again, independently. 

The rest of the chapter consists of a description of our last tool, and then some of the 
consequences that can be deduced from all this, about sequences of random variables. We leave you 
with some problems that can be solved by clever use of the ideas we have discussed so far, namely: 

1. 

the expectation of a sum of several random variables is always the sum of their 
expectations. 

2. 

the expectation of a product of independent random variables is the product of their 
expectations, 

7.4 The uses of the variance. Tchebychev's inequality 

The definition of the variance of f is equivalent to its interpretation as the mean square of f s 
deviation from its mean. 

However, it can be used to put an upper limit on the probability that a random variable takes 
on values with deviations greater than xc for any x greater than 1 . 

The simplest and worst such bound is called Tchebyshev's inequality. 
It is an answer to the question: 

how much of the probability of a variable can correspond to a deviation greater than xo? 

You can see right away that if you want to minimize the mean square deviation, and have a given 
amount, z, of deviation greater than xc, you are best off having all the deviation that is less than xc 
have deviation, and the rest have deviation just barely greater than xa. 



The mean square deviation in that case will then be z(xo) which can be at most o . 

We deduce from this statement that the probability, z, that the deviation is greater than xo 
is at most x , and this is Tchebychev's inequality. 

This inequality tells us that 

the probability that your random variable differs from its mean by more than xo is at most x 
When <r is very small, this statement implies that your random variable is close to its mean over a 
portion of your sample space that has high probability. 

And that kind of statement, when o of your variable goes to as the length of our bit string 
increases, is called a Law of Large Numbers. 

7.5 Some Conclusions. 

Suppose first that 

we have N independent variables each having values of and 1 only, and each with a probability 
p of being 1. 

The sum of these variables will be k when k of them are 1 and N-k are 0. 

Every specific way that this happens has probability given by the product of the 

probabilities of each variable being what it is, and that will be p k (l-p) N_k since the variables 
are all independent of one another. 

There are C(N,k) different ways of this happening and these are mutually exclusive, so 
that their probabilities add, and the probability that their sum is k is p k (l-p) N_k C(N,k). 

Thus, in this case we deduce: 

p(Sum=k) = p(Average=k/N) = is p k (l-p) Nk C(N,k). 

This is called a binomial distribution and its histogram goes up and comes down with a 
bell shape, for large N and fixed p. 

We know from our earlier mutterings, that the mean of this distribution is the sum of the 
means of its individual summands, which are p, so that the mean of the sum is pN, and the mean 
of their average is p itself. 

Furthermore, since these variables are assumed to be independent, the variance of their sum is 
the sum of their individual variances, p(l-p), or Np(l-p), and the variance of their average is 
P(l-P)/N. 

(remember that the variance is quadratic, so that dividing the sum by N, which is the way we average 
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variables, divides the variance by N ) 

This tells us, by Tchebychef s inequality, that the average of our N variables approaches 
p except on a set whose probability approaches 0, and this is the "Weak Law of Large 
Numbers." 



This statement provides a meaning of sorts, to the probability p of a single variable. It is also 
exactly the statement we will need in the next section. 

Similar results to all of those above hold when assumptions are weakened. Here all variables 
are assumed to be independent, and identically distributed. 

Thus, as long as the variables are independent and each of their variances are small compared 
to the sum of the variances, it is possible to prove that the resulting probability for the average of the 
variables, is not unlike a binomial distribution, except it will have mean given by the average of the 

p's and variance given by the sum of the individual variances divided by N . 

This holds true for sums of random variables which take on more values than and 1 , and 
even (with some changes) when there is a limited amount of dependence among the variables, as well. 

Results of this kind are called central limit theorems. 

When we add a sum of a large number of independent and identically distributed random 
variables, and the sum of their means stays finite, we get a special case of a binomial distribution, 
called a Poisson distribution. 

Wh 

en the mean approaches m, this distribution has the form 

p(k) = (exp(-m)) m k /k!. 

Again, distributions can tend to this one with much weakened assumptions about them. 

You should also realize that the facts about probability that we have noticed above, that the 
mean of a sum is the sum of the means in all cases, that the variance of the sum of independent 
variables is the sum of the variances, etc. are extremely useful tools for solving probabilistic 
questions, as you will see in the exercises below. 

Exercises: 

7.1 is buried above. Find it! Do it! 

7.2 use a spreadsheet to plot (chart x y scatter) p(k/N) vs k for binomial distributions having 
p=l/2 and p=l/3 for values of N from 5 to however high you can, on one sheet for each p. 

7.3 . Derive the formula for a Poisson distribution from the binomial distribution with p=m/N, 
for finite k as N gets large. 

7.4 Given a class of N students with independently chosen birthdays, deduce the probability that 
no two have the same birthday. This can be done by deducing a formula for it and taking logs 
and expanding them to approximate the answer. 

Another way to estimate the answer is by assuming you have a Poisson distribution of the 
number of pairs of students with the same birthday. How do these answers compare? 

7.5 a. Suppose you pick 2 numbers between and 1 at random (that is, the probability of 
picking a number in any interval of length d is 1/d) for each). What is the expected value of their 
difference? Hint: what is the answer if you pick 3 points at random on a unit circle? 

b. Suppose you pick 2 distinct numbers at random between 1 and N. What is the expected value 
of their difference? Hint: compare with part a. 

c. Suppose you pick k distinct numbers at random out of N? same question. 

7.6 Suppose k persons get on an elevator in the basement of an office building one mornings and 
they get out each with equal probability of emerging at each floor there being m floors at which 



they might get out, and suppose they get out independently. Find the expected number of floors 
at which the elevator does not stop, because nobody gets out. 
Evaluate for m=ll and k=10 and vice versa. 



